xxxxxxxxxx# College dataset__Description__ Statistics for a large number of US Colleges from the 1995 issue of US News and World Report. Dimensions : 777 x 18 [Short description of variables (appendix)](#Short-description-of-variables)__Sources__ This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University. The dataset was used in the ASA Statistical Graphics Section's 1995 Data Analysis Exposition.__References__ This dataset is a part of the course material of the [book](https://www.statlearning.com/) : ___Introduction to Statistical Learning with R___ (Ch 02 - Statistical Learning - Applied Exercises - Problem 8)Description
Statistics for a large number of US Colleges from the 1995 issue of US News and World Report.
Dimensions : 777 x 18
Short description of variables (appendix)
Sources
This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University. The dataset was used in the ASA Statistical Graphics Section's 1995 Data Analysis Exposition.
References
This dataset is a part of the course material of the book : Introduction to Statistical Learning with R
(Ch 02 - Statistical Learning - Applied Exercises - Problem 8)
xxxxxxxxxx---xxxxxxxxxx### Index- [Short description of variables (appendix)](#Short-description-of-variables)- [2.8a - Import data](#2.8a---Import-data) - [Preliminary_Observations](#Preliminary_Observations)- [2.8b - Data preparation](#2.8b---Data-preparation)- [2.8c - Data exploration](#2.8c---Data-exploration) - [2.8c.1 - Summary statistics](#2.8c.1---Summary-statistics) - [2.8c.2 - Scatterplot matrix](#2.8c.2---Scatterplot-matrix) - [Observations - Pairplots](#Observations_Pairplots) - [Correlation](#Correlation) - [2.8c.3 - Boxplot](#2.8c.3---Boxplot) - [Observations - Outstate v Private](#Observations_-_Outstate_~_Private) - [2.8c.4 - Elite](#2.8c.4---Elite) - [Observations](#Observations_-_Elite) - [2.8c.5 - Histograms](#2.8c.5---Histograms) - [a) Student expenditure related variables](#a%29-Student-expenditure-related-variables) - [Observations](#Observations_-_Student) - [b) Faculty and student related ratios](#b%29-Faculty-and-student-related-ratios) - [Observations](#Observations_-_Faculty_Student_Ratios) - [2.8c.6 - Further data exploration](#2.8c.6---Further-data-exploration) - [a) Spending patterns - private vs non-private](#a%29-Spending-patterns---private-vs-non-private) - [Observations](#Observations_-_Student_spending) - [b) Most sought after college/university](#b%29-Most-sought-after-college/university) - [Most Sought-after Colleges/Universities (Final list)](#Most-Sought-after-Colleges/Universities-(Final-list%29) - [c) Further analysis of Most sought-after colleges/univeristies](#c%29-Further-analysis-of-Most-sought-after-colleges/univeristies) - [Observations](#Observations_-_MSA) - [d) Top 20 colleges by applications](#d%29-Top-20-colleges-by-applications) - [Observations](#Observations_-_Top_Apps) - [e) Further analysis of Elite colleges](#e%29-Further-analysis-of-Elite-colleges) - [Observations](#Observations_-_Top_Elite)- [Code help sources](#Code-help-sources)xxxxxxxxxx###### ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------xxxxxxxxxx# Import requisite packagesimport osimport timeimport numpy as npimport pandas as pdfrom statistics import *from scipy.stats import kurtosis, skewimport matplotlib.pyplot as pltimport seaborn as sns%matplotlib inline# pd.options.display.float_format = '{:,.3f}'.formatxxxxxxxxxxdef sns_pars(title=13, label=12, font=10): sns.set_context(rc={"axes.titlesize":title,"axes.labelsize":label,"font.size":font})xxxxxxxxxxfrom IPython.core.interactiveshell import InteractiveShellxxxxxxxxxx###### ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------xxxxxxxxxxfile_dir = r"E:\Data Science\Statistics\Intro to Statistical Learning with R"data_path = os.path.join(file_dir,'datasets','College.csv')os.path.exists(data_path)xxxxxxxxxxcollege = pd.read_csv(data_path)print(college.shape)college.head()xxxxxxxxxx## Alternatives## 1# os.chdir(r"E:\Data Science\Statistics\Intro to Statistical Learning with R")# college = pd.read_csv('./datasets/College.csv')## 2# url = "https://statlearning.com/College.csv"# college = pd.read_csv(url)# college.head()xxxxxxxxxxcollege.info()xxxxxxxxxx## Missing values in data# countcollege.isna().any().sum()# any missing data# college.isna().any().any()xxxxxxxxxx## Columns with missing values# countcollege.isna().any().sum()# column list with missing data# college.columns[college.isna().any()].values.tolist()# columns with NAs# college.loc[:, college.isna().any()]xxxxxxxxxx## Rows with missing values# countcollege.isna().values.sum()# row list with missing data# [idx for idx, el in zip(college.index, college.isnull().any(axis=1)) if el == True]# rows# college[college.isna().any(axis=1)]xxxxxxxxxx<div class="alert alert-block alert-info"><a id='Preliminary_Observations'></a><b>Preliminary observations:</b><br> - No missing values.<br> - Currently, college/universities' names form part of the dataset. They will be added as index and removed from the executable data.<br> - Categorical variable 'Private' is presently saved as object type. It will be converted to category.</div>xxxxxxxxxx###### ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------# Create index from college names and drop the columncollege = college.set_index('Unnamed: 0', append=True)# Set index titlecollege.rename_axis(index=[None,'College'], inplace=True)# Change nature of 'Private' to categoricalcollege['Private'] = college['Private'].astype('category')# Confirm changesprint(college.dtypes)college.head(1)xxxxxxxxxx###### ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------xxxxxxxxxx# Numerical featurescollege.describe(percentiles=[0.05,0.25,0.75,0.95])xxxxxxxxxx# Categorical and character featurescollege.describe(include=['object','category'])xxxxxxxxxx# Frequency tablesdf_cat = college.select_dtypes('category')for i in df_cat.columns: print(i,'\n','-'*15, sep='') print(df_cat[i].value_counts()) print('-'*30)xxxxxxxxxx###### ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------xxxxxxxxxx# Select variables for pairs plotselected_vars = [c for c in college.columns if c not in\ ('F.Undergrad P.Undergrad Books S.F.Ratio perc.alumni').split()]print(selected_vars)len(selected_vars)xxxxxxxxxx# Scatterplotplt.rc('figure', facecolor='w')plt.rcParams.update({'font.size':10,'axes.labelsize':18})sns.pairplot(college[selected_vars]);sns.reset_orig()xxxxxxxxxx<div class="alert alert-block alert-info"><a id='Observations_Pairplots'></a><b>Tentative observations:</b><br> - Private colleges have a much higher 'Expend' (instructional expenditure per student)<br>- There is moderately positive relationship bet the colleges preferred by the Top10perc and the Outstate tuition charged<br></div>xxxxxxxxxx##### '#####################################################################xxxxxxxxxx# High correlation relations# Correlation of numeric columnscorr = college.corr().round(3)# Display |correlations| > 0.55corr[abs(corr) < 0.55] = '-'corrxxxxxxxxxx##### #'#####################################################################xxxxxxxxxx## Correlation plot# Generate a mask for the upper trianglemask_up = np.triu(np.ones_like(corr, dtype=bool))f, ax = plt.subplots(figsize=(13, 10))sns.heatmap(college.corr(), annot=True, mask=mask_up, fmt='.2f', cmap="YlGnBu", annot_kws={"size":11,"alpha":1}, cbar = False, square= True, linewidths=0.01, cbar_kws={"shrink": .5});plt.title("Heatmap", fontsize=15)plt.ylabel("", fontsize=14)plt.yticks(fontsize=12);xxxxxxxxxx# reset plot parameters to defaultssns.reset_orig()xxxxxxxxxx###### ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------xxxxxxxxxxwith sns.plotting_context(): plt.gcf().set_size_inches(12,3) plt.gca().set(frame_on=False) sns.boxplot(y='Private', x='Outstate', data=college, width=0.5) plt.title('Outstate tuition - Private v Non-Private');xxxxxxxxxx<div class="alert alert-block alert-info"><a id='Observations_-_Outstate_~_Private'></a><b>Observations:</b><br> - Out-of-state tuition charged by the private colleges is higher for private institutions.</div>xxxxxxxxxx###### ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------xxxxxxxxxx### 2.8c.4 - EliteElite >> universities with Top10perc > 50%Elite >> universities with Top10perc > 50%
xxxxxxxxxx# Glance at Top10percf, axes = plt.subplots(1, 2, sharex=False, figsize=(15,4), gridspec_kw={"width_ratios":(.35,.65)})sns.histplot(college['Top10perc'], ax=axes[0])sns.boxplot(x=college['Top10perc'], y=college['Private'], width=0.3, ax=axes[1])f.suptitle('Top10perc', fontsize=15, ha='right',va='bottom')axes[0].set_title('Top10perc Histogram')axes[1].set_title('Top10perc Distribution - Private v Non-Private')sns.despine(left=True, bottom=True, right=True, top=True);xxxxxxxxxx# Create new variable 'Elite' >> Top10perc > 50college.loc[college['Top10perc'] > 50, 'Elite'] = 'Yes'college['Elite'] = college['Elite'].fillna('No')# Ensure type is category (by default it would be saved as 'object')college['Elite'] = college['Elite'].astype('category')print(college['Elite'].dtype)college.sample(n=3)xxxxxxxxxx# No. of Elite universities/collegescollege['Elite'].value_counts()xxxxxxxxxx# Elite ~ Private (Contingency table)pd.crosstab(college.Elite, college.Private, margins=False)xxxxxxxxxx# Outstate tuition - Elitewith sns.plotting_context(): sns_pars(15,12,10) plt.gcf().set_size_inches(15,4) plt.gca().set(frame_on=False) plt.title('Outstate tuition v Elite') sns.boxplot(x=college['Outstate'], y=college['Elite'], width=0.3);xxxxxxxxxx<div class="alert alert-block alert-info"><a id='Observations_-_Elite'></a><b>Observations:</b><br>- 83% (65 of 78) of the 'Elite' institutions are private.<br>- The distribution of Outstate tuition in Elite universities is heavily right-skewed indicating that most of the 'Elite' institutions charge high out-of-state tuition.<br>- The median Outstate tuition in Elite institutions is much higher than in Non-elite instituition, pointing to a clear difference between the educational accessibility for out-of-state students.</div>xxxxxxxxxx###### ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------xxxxxxxxxx### 2.8c.5 - HistogramsFreedman-Diaconis method has been used for calculating bin-widths.Freedman-Diaconis method has been used for calculating bin-widths.
xxxxxxxxxx###### a) Student expenditure related variablesxxxxxxxxxx# Total student expenditure (with and without Outstate tuition)# Student exp = Room.Board + Books + Personal + Outstatecollege['TExp.without.O'] = college[['Room.Board','Books','Personal']].sum(axis=1)college['TExp.with.O'] = college[['Outstate','Room.Board','Books','Personal']].sum(axis=1)xxxxxxxxxx"a) Student expenditure related variables"# selected variablesvar = ('Outstate Room.Board Books Personal TExp.without.O TExp.with.O Expend').split(' ')xlabels = varylabels = ['Freq'] * len(var)# determine subplot shapefrom math import ceiln = len(var) cls = 3rws = ceil(n/cls)# plotf, axes = plt.subplots(rws, cls, figsize=(17, 4*rws))r = 0c = 0for i in range(len(var)): if cls > 1 and rws > 1: plt.sca(axes[r,c]) elif rws == 1: plt.sca(axes[c]) elif cls == 1: plt.sca(axes[r]) col = var[i] plt.gca().set(frame_on=False) sns.histplot(college[col].values).set(title=col, xlabel='', ylabel='') f.subplots_adjust(bottom=-0.1) med = median(college[col]) min_ylim, max_ylim = plt.ylim() plt.vlines(x=median(college[col]), ymin=0, ymax=max_ylim, colors='k', ls=':', lw=2) plt.text(med, max_ylim*1.01, 'Median : '+str(med), fontsize=10, ha='center', va='bottom') if c == cls-1: c = 0 r += 1 continue else: c += 1# delete extra gridsextra_grids = (rws)*(cls) - nfor k in range(cls-extra_grids, cls): f.delaxes(axes[r,k]);xxxxxxxxxx<div class="alert alert-block alert-info"><a id='Observations_-_Student'></a><b>Observations:</b><br>- Out-of-state tuition and Room.Board expenses are slightly positively skewed.<br>- Expenditure on 'Books', 'Personal' expenses of students and 'Expend' (instructional exp per student) are positively skewed.<br> - Median Total Expenditure with Outstate tuition is \$16,079. <br>  Median household income in the same year (1995) as per the <a href="https://www.census.gov/library/publications/1996/demo/p60-193.html">US Census</a> was ≈ $34,000.</div>xxxxxxxxxx# boxplots - Student expenses with and without Outstate tuitionwith sns.plotting_context(): sns_pars(15,15,12) plt.gcf().set_size_inches(17,4) plt.gca().set(frame_on=False) sns.boxplot(y="variable", x="value", data=pd.melt(college.iloc[:,-2:]), width=0.3) plt.title('Student expenses with and without Outstate tuition') plt.xlabel('Student Expenses') plt.ylabel('')xxxxxxxxxx# Skewness and Kurtosispd.DataFrame({'skew':college[var].skew(axis=0), 'kurt':college[var].kurtosis(axis=0)}).Txxxxxxxxxx###### ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------xxxxxxxxxx"b) Faculty and student related ratios"# selected variablesvar = ('PhD, Terminal, S.F.Ratio, perc.alumni, Grad.Rate').split(', ')xlabels = varylabels = ['freq']*len(var)# determine subplot shapefrom math import ceiln = len(var) cls = 3rws = ceil(n/cls)# plotf, axes = plt.subplots(rws, cls, figsize=(17, 4*rws))r = 0c = 0for i in range(len(var)): if cls > 1 and rws > 1: plt.sca(axes[r,c]) elif rws == 1: plt.sca(axes[c]) elif cls == 1: plt.sca(axes[r]) col = var[i] plt.gca().set(frame_on=False) sns.histplot(college[col].values).set(title=col, xlabel='', ylabel='') f.subplots_adjust(bottom=-0.1) med = median(college[col]) min_ylim, max_ylim = plt.ylim() min_xlim, max_xlim = plt.xlim() plt.vlines(x=median(college[col]), ymin=0, ymax=max_ylim, colors='k', ls=':', lw=2) plt.text(med, max_ylim*1.01, 'median : '+str(med), fontsize=10, ha='center', va='bottom') if c == cls-1: c = 0 r += 1 continue else: c += 1# delete extra gridsextra_grids = (rws)*(cls) - nfor k in range(cls-extra_grids, cls): f.delaxes(axes[r,k]);xxxxxxxxxx<div class="alert alert-block alert-info"><a id='Observations_-_Faculty_Student_Ratios'></a><b>Observations:</b><br>- PhD and Terminal are hevily left-skewed, i.e. most of the faculty is highly specialised in their respective disciplines.<br>   PhD has one bin > 100. This could be a mistake.<br>- Not many colleges have student-faculty ratio > 20.<br>- There is wide fluctuation in Graduation rate with 17.63% of institutions having graduation rates below 50%.<br> - IQR (Q3-Q1) of alumnis who donate ranges from 13% to 31%.<br> **Note**: see workings below for calculations<br> </div>Note:
see workings below for calculations
xxxxxxxxxxcollege.index[college['PhD'] > 100].tolist()xxxxxxxxxx## Graduation ratecollege['Grad.Rate'].describe().to_frame().Txxxxxxxxxx## No.of Istitutions with graduation rate < 50%sum(college['Grad.Rate'] < 50)xxxxxxxxxx## % of Istitutions with graduation rate < 50%pd.Series(college['Grad.Rate'] < 50).value_counts(normalize=True)*100xxxxxxxxxxcollege['perc.alumni'].describe().to_frame().Txxxxxxxxxx###### ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------xxxxxxxxxx##### a) Spending patterns - private vs non-privatexxxxxxxxxx# Proportion of Privateg = college.groupby('Private')['Personal']pd.concat([g.size().rename('Count'), g.size().transform(lambda x: x/x.sum()).rename('Proportion'), g.median().rename('Median')], axis=1)xxxxxxxxxxg.apply(lambda x: len(x[x>1649]))xxxxxxxxxx# Histogramsplt.subplots(1, 2, figsize=(10, 4))plt.subplot(121)sns.histplot(x=college['Personal'][college.Private == 'Yes']).set(title='Personal in Private')plt.subplot(122)sns.histplot(x=college['Personal'][college.Private != 'Yes']).set(title='Personal in Non-Private');xxxxxxxxxx# Boxplotswith sns.plotting_context(): plt.gcf().set_size_inches(10,4) sns.boxplot(x='Personal', y='Private', data=college, width=0.3)xxxxxxxxxx<div class="alert alert-block alert-info"><a id='Observations_-_Student_spending'></a><b>Observations:</b><br>- Distribution in private is highly positively skewed while distribution in non-private is moderately positively skewed.<br> - Median personal spending by students in non-private (\$1649) is higher than in private (\$1100).<br>- The number of institutions where Personal spending is > \$1649 (median(non-private)) is almost similar for both private (103) and non-private (106).</div>xxxxxxxxxx###### ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------xxxxxxxxxx##### b) Most sought after college/universityInstitutions with high - Top10perc- Top25perc- Apps- No. of Apps per Enroll- No. of Enroll per AcceptInstitutions with high
xxxxxxxxxx## Selected factorsvar = ('Apps Accept Enroll Top10perc Top25perc F.Undergrad').split(' ')# Creating subset of dataframeadmission_data = college[var].copy()## Calculating ratios# Application ratioadmission_data['apr'] = round(admission_data.Apps / admission_data.Enroll * 100).astype('int')# Acceptance rateadmission_data['acr'] = round(admission_data.Accept / admission_data.Apps * 100).astype('int')# Enrollment rateadmission_data['enr'] = round(admission_data.Enroll / admission_data.Accept * 100).astype('int')admission_data.info()xxxxxxxxxx# Top 100 colleges by Top10perctop100_Top10perc = admission_data.sort_values(by=['Top10perc','College'], axis=0, ascending=[False,True])[0:100]print(top100_Top10perc.shape)top100_Top10perc.head(10)xxxxxxxxxx# Top 100 colleges by Top25 perctop100_Top25perc = admission_data.sort_values(by=['Top25perc','Top10perc'], ascending=[False,False])[0:100]top100_Top25perc.head(10)xxxxxxxxxx# Top 300 colleges by no. of applicationstop300_Apps = admission_data.sort_values(by=['Apps','College'], ascending=[False,False])[0:300]top300_Apps[0:5]xxxxxxxxxx# Top 300 colleges by application rate (Apps / Enroll)top300_apr = admission_data.sort_values(by=['apr','College'], ascending=[False,False])[0:300]top300_apr.head(5)xxxxxxxxxx# Top 300 colleges by enrollment rate (Enroll / Accept)top300_enr = admission_data.sort_values(by=['enr','College'], ascending=[False,False])[0:300]top300_enr[0:5]xxxxxxxxxx# Common# List of all listsall = [top100_Top10perc.index.get_level_values('College').tolist(), top100_Top25perc.index.get_level_values('College').tolist(), top300_Apps.index.get_level_values('College').tolist(), top300_enr.index.get_level_values('College').tolist(), top300_apr.index.get_level_values('College').tolist()]from functools import reducecommon = list(reduce(lambda i, j: set(i).intersection(set(j)), all))len(common)xxxxxxxxxx###### Most Sought-after Colleges/Universities (Final list)xxxxxxxxxx# Most Sought-after Colleges / Universitiesmsal = top100_Top10perc.loc[pd.IndexSlice[:,common], :]msalxxxxxxxxxx###### ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------xxxxxxxxxx##### c) Further analysis of Most sought-after colleges/univeristiesmsa = college.loc[msal.index]msa.Private.value_counts()InteractiveShell.ast_node_interactivity = "all"xxxxxxxxxxcollege.describe(percentiles=[.05,.10,.25,.5,.75,.9,.95]).astype('int')pd.set_option('max_columns',None)msaxxxxxxxxxx<div class="alert alert-block alert-info"><a id='Observations_-_MSA'></a><b>Observations:</b><br> - 14 of the 16 (87.5%) MSAs (most sought-after institutions) are private.<br> - 15 (93.75%) have Grad.Rate > 90th percentile with 1 having ≈ 85th percentile with 83% grad rate.<br> - Outstate tuition for 13 (81.25%) MSAs is among the top 90th percentile.<br> </div>xxxxxxxxxx## Percentile tablefrom scipy.stats import percentileofscore as ptilesubset_df = msabase_df = collegevar_main = 'Top10perc'var = 'Grad.Rate'n = len(subset_df.index)k = 90# Percentilespercentiles_main = [round(ptile(base_df[var_main].values,i,'weak'),2) for i in subset_df[var_main].values]percentiles = [round(ptile(base_df[var].values,i,'weak'),2) for i in subset_df[var].values]percentile_df = pd.DataFrame(np.column_stack([subset_df.index.get_level_values('College'), subset_df[var], percentiles, percentiles_main]))percentile_df.columns = ['College', var, 'Percentile', 'Percentile_Main']# Count below and above kth percentileprint('<',k,': ',sum(percentile_df['Percentile']<k),'('+str(sum(percentile_df['Percentile']<k)/n*100)+')')print('>=',k,': ',sum(percentile_df['Percentile']>=k),'('+str(sum(percentile_df['Percentile']>=k)/n*100)+')')percentile_df.sort_values(by=['Percentile','Percentile_Main'], ascending=False)xxxxxxxxxx## Boxplots - Overall distribution v subset distributiondef applyBoxPlotStyle(fig, b=True): sns.despine(left=True, bottom=b, top=True, right=True) plt.setp(fig.lines, color='k', linewidth=0.5) plt.setp(fig.artists, edgecolor='k', linewidth=0.5)fig, (ax1, ax2) = plt.subplots(nrows=2, sharex=True, figsize=(12,2), constrained_layout=True)ax1 = plt.subplot(211)sns.boxplot(x=subset_df[var], width=0.3, color='thistle', fliersize=3).set(title=var)applyBoxPlotStyle(ax1)plt.axis('off')ax2 = plt.subplot(212, sharex=ax1)box = sns.boxplot(x=base_df[var], width=0.4, color='cadetblue', fliersize=3)ax2.tick_params(left=False)ax2.set(xlabel='')applyBoxPlotStyle(ax2)## Histograms - Overall and subsetplt.subplots(1, 2, figsize=(10, 3))plt.subplot(121)plt.box(False)sns.histplot(x=base_df[var]).set(title=var+'\n(Overall)')plt.subplot(122)sns.histplot(x=subset_df[var]).set(title=var+'\n(Subset)')plt.box(False)plt.subplots_adjust(bottom=-0.3)plt.tight_layout()## Scatterplot - var_main vs varplt.subplots(1, 2, figsize=(10, 3))plt.subplot(121)sns.scatterplot(x=base_df[var_main], y=base_df[var]).set(title=var_main+' v '+var+'\n(Overall)')plt.box(False)plt.subplot(122)sns.scatterplot(x=subset_df[var_main], y=subset_df[var]).set(title=var_main+' v '+var+'\n(Subset)')plt.box(False)plt.subplots_adjust(bottom=-0.3)plt.tight_layout();xxxxxxxxxx## Statistical test for comparing base and subset# Anderson-Darling test for comparing 2 samplesfrom scipy.stats import anderson_ksampanderson_ksamp([base_df[var], subset_df[var]], midrank=True)# Kolmogorov-Smirnov testfrom scipy.stats import ks_2sampks_2samp(base_df[var], subset_df[var])xxxxxxxxxxInteractiveShell.ast_node_interactivity = "last_expr"xxxxxxxxxx##### '#####################################################################xxxxxxxxxximport warningsdef fxn(): warnings.warn("UserWarning arose", UserWarning)xxxxxxxxxx## Boxplotssubset_df = msa.select_dtypes('number')base_df = collegeovars = [c for c in subset_df.columns if c in base_df]n = len(ovars)with warnings.catch_warnings(): warnings.simplefilter("ignore") fxn() for var in ovars: fig, (ax1, ax2) = plt.subplots(nrows=2, sharex=True, figsize=(12,2), facecolor='w') sns_pars(13,14,9) adt = anderson_ksamp([base_df[var],subset_df[var]], midrank=True) kst = ks_2samp(base_df[var],subset_df[var], 'two-sided') plt.figtext(0, 1, 'KS test: p-val ='+str(kst[1].round(5)), fontsize=10) plt.figtext(0, 0.9, 'AD test: sig-lvl ='+str(round(adt[2],5)), fontsize=10) ax1 = plt.subplot(211) sns.boxplot(x=subset_df[var], width=0.4, color='thistle', fliersize=3) applyBoxPlotStyle(ax1) plt.axis('off') ax1.set(title = var) ax2 = plt.subplot(212, sharex=ax1) box = sns.boxplot(x=base_df[var], width=0.4, color='cadetblue', fliersize=3) applyBoxPlotStyle(ax2,False) ax2.set(xlabel=None) ax2.tick_params(left=False) fig.subplots_adjust(bottom=0.3);sns.reset_orig()xxxxxxxxxx##### #'#####################################################################xxxxxxxxxx## Statistical test comparing variables and their subsettest_df = pd.DataFrame(np.zeros([n,3]),columns=['KS (p-value)','AD (min sig lvl)','Significant'], index=subset_df.columns)with warnings.catch_warnings(): warnings.simplefilter('ignore') fxn() for var in list(subset_df.columns): # Kolmogrov-Smirnov test kst = ks_2samp(base_df[var],subset_df[var], 'two-sided') ksp = kst[1] ksps = ksp <= 0.05 # Anderson-Darling test adt = anderson_ksamp([base_df[var],subset_df[var]], midrank=True) adsl = adt[2] adss = adt[0] > adt[1][2] # Add to df sig = 'Y' if (ksps and adss) else 'N' if (not ksps and not adss) else '-' test_df.loc[var] = [round(ksp,5),round(adsl,5),sig]test_dfxxxxxxxxxx###### ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------xxxxxxxxxxInteractiveShell.ast_node_interactivity = "all"xxxxxxxxxx## Top 20 colleges by no. of applicationstop20_Apps = college.sort_values(by='Apps', ascending=False)[0:20]top20_Apps = top20_Apps.merge(admission_data[['apr','acr','enr']], how='left', left_index=True, left_on='College', right_on='College')college.describe(percentiles=[0.05,0.1,0.25,0.5,0.75,0.9,0.95]).astype('int')top20_Appsxxxxxxxxxx<div class="alert alert-block alert-info"><a id='Observations_-_Top_Apps'></a><b>Observations:</b><br> - 18 of the top 20 Apps colleges are non-private.<br> - Institutions with high applications (HAIs) also have high acceptance.<br> - High applications also accompany high enrollment numbers but the enrollment rates distribution for HAIs is not very different from the overall enrollment rates distribution. <br> This suggests that although applications are high, not many go on to enroll. Many applications could be backup applications.<br> - HAIs have a statistically significant distribution (compared to the overall sample) for all variables except Outstate, Room.Board, Books, Personal.<br><br><i>Note: See workings below.</i></div>xxxxxxxxxx## Percentile tablesubset_df = top20_Appsbase_df = collegevar_main = 'Apps'var = 'Accept'n = len(subset_df.index)k = 90# Percentilespercentiles_main = [round(ptile(base_df[var_main].values,i,'weak'),2) for i in subset_df[var_main].values]percentiles = [round(ptile(base_df[var].values,i,'weak'),2) for i in subset_df[var].values]percentile_df = pd.DataFrame(np.column_stack([subset_df.index.get_level_values('College'), subset_df[var], percentiles, percentiles_main]))percentile_df.columns = ['College', var, 'Percentile', 'Percentile_Main']# Count below and above kth percentileprint('<',k,': ',sum(percentile_df['Percentile']<k),'('+str(sum(percentile_df['Percentile']<k)/n*100)+')')print('>=',k,': ',sum(percentile_df['Percentile']>=k),'('+str(sum(percentile_df['Percentile']>=k)/n*100)+')')percentile_df.sort_values(by=['Percentile','Percentile_Main'], ascending=False)xxxxxxxxxx## Boxplots - Overall distribution v subset distributionfig, (ax1, ax2) = plt.subplots(nrows=2, sharex=True, figsize=(12,2), constrained_layout=True)ax1 = plt.subplot(211)sns.boxplot(x=subset_df[var], width=0.3, color='thistle', fliersize=3).set(title=var)applyBoxPlotStyle(ax1)plt.axis('off')ax2 = plt.subplot(212, sharex=ax1)box = sns.boxplot(x=base_df[var], width=0.4, color='cadetblue', fliersize=3)ax2.tick_params(left=False)ax2.set(xlabel='')applyBoxPlotStyle(ax2)## Histograms - Overall and subsetplt.subplots(1, 2, figsize=(10, 3))plt.subplot(121)plt.box(False)sns.histplot(x=base_df[var]).set(title=var+'\n(Overall)')plt.subplot(122)sns.histplot(x=subset_df[var]).set(title=var+'\n(Subset)')plt.box(False)plt.subplots_adjust(bottom=-0.3)plt.tight_layout()## Scatterplot - var_main vs varplt.subplots(1, 2, figsize=(10, 3))plt.subplot(121)sns.scatterplot(x=base_df[var_main], y=base_df[var]).set(title=var_main+' v '+var+'\n(Overall)')plt.box(False)plt.subplot(122)sns.scatterplot(x=subset_df[var_main], y=subset_df[var]).set(title=var_main+' v '+var+'\n(Subset)')plt.box(False)plt.subplots_adjust(bottom=-0.3)plt.tight_layout();xxxxxxxxxx## Statistical test for comparing base and subset# Anderson-Darling test for comparing 2 samplesfrom scipy.stats import anderson_ksampanderson_ksamp([base_df[var], subset_df[var]], midrank=True)# Kolmogorov-Smirnov testfrom scipy.stats import ks_2sampks_2samp(base_df[var], subset_df[var])xxxxxxxxxxInteractiveShell.ast_node_interactivity = "last_expr"xxxxxxxxxx##### #'#####################################################################xxxxxxxxxx## Boxplotssubset_df = top20_Apps.select_dtypes('number')base_df = collegeovars = [c for c in subset_df.columns if c in base_df]n = len(ovars)with warnings.catch_warnings(): warnings.simplefilter("ignore") fxn() for var in ovars: fig, (ax1, ax2) = plt.subplots(nrows=2, sharex=True, figsize=(12,2), facecolor='w') sns_pars(13,14,9) adt = anderson_ksamp([base_df[var],subset_df[var]], midrank=True) kst = ks_2samp(base_df[var],subset_df[var], 'two-sided') plt.figtext(0, 1, 'KS test: p-val ='+str(kst[1].round(5)), fontsize=10) plt.figtext(0, 0.9, 'AD test: sig-lvl ='+str(round(adt[2],5)), fontsize=10) ax1 = plt.subplot(211) sns.boxplot(x=subset_df[var], width=0.4, color='thistle', fliersize=3) applyBoxPlotStyle(ax1) plt.axis('off') ax1.set(title = var) ax2 = plt.subplot(212, sharex=ax1) box = sns.boxplot(x=base_df[var], width=0.4, color='cadetblue', fliersize=3) applyBoxPlotStyle(ax2,False) ax2.set(xlabel=None) ax2.tick_params(left=False) fig.subplots_adjust(bottom=0.3);sns.reset_orig()xxxxxxxxxx##### #'#####################################################################xxxxxxxxxx## Statistical test comparing variables and their subsettest_df = pd.DataFrame(np.zeros([n,3]),columns=['KS (p-value)','AD (min sig lvl)','Significant'], index=ovars)with warnings.catch_warnings(): warnings.simplefilter('ignore') fxn()###### Statistical test for var in list(ovars): # Kolmogrov-Smirnov test kst = ks_2samp(base_df[var],subset_df[var], 'two-sided') ksp = kst[1] ksps = ksp <= 0.05 # Anderson-Darling test adt = anderson_ksamp([base_df[var],subset_df[var]], midrank=True) adsl = adt[2] adss = adt[0] > adt[1][2] # Add to df sig = 'Y' if (ksps and adss) else 'N' if (not ksps and not adss) else '-' test_df.loc[var] = [round(ksp,5),round(adsl,5),sig]test_dfxxxxxxxxxx###### ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------xxxxxxxxxxInteractiveShell.ast_node_interactivity = "all"xxxxxxxxxx## Elite colleges >> Top10perc > 50elite_df = college.loc[college.Elite == 'Yes',:].droplevel(0, axis=0) # drop index level 0elite_df = elite_df.sort_values(by=['Top10perc','Top25perc'],ascending=[False,False])# create zipped list of tuples with (ranges(n) + college names)tuples = list(zip(list(range(len(elite_df))),elite_df.index.tolist()))# replace indes with multi-index made of tupleselite_df.index = pd.MultiIndex.from_tuples(tuples, names=['SN','College'])elite_df.shape## Top 20 elite collegestop20_elite = elite_df[0:20]college.describe(percentiles=[0.05,0.1,0.25,0.5,0.75,0.9,0.95]).astype('int')top20_elitexxxxxxxxxx<div class="alert alert-block alert-info"><a id='Observations_-_Top_Elite'></a><b>'Elite'</b> : College/Universities that have >50% proportion of Top10perc students.<br><b>'Top10perc'</b> : % of new students from top 10% of their High School class<br><br><b>Observations:</b><br> - There are 78 (10% of 777) 'Elite' institutions.<br> - The distribution of every variable is different in 'Elite' colleges when compared with the variable's overall distribution, except in the case of 'Books' and 'Personal'.<br><br> <b>Top 20 Elite:</b><br> - Unsurprisingly, the top 'Elite' institutions also have the highest proportion of students that graduated in the top 25% of their high schools. All 20 are among the top 97th percentile of <b>Top25perc</b>.<br> - Min <b>'Phd'</b> and <b>'Terminal'</b> proportions are 91% and 92% respectively. <br> - 19 out of 20 institutions have faculty with <b>'PhD'</b>s within the top 90th percentile.<br> - 18 out of 20 institutions have faculty with <b>'Terminal'</b> degrees within the top 90th percentile.<br> - 16 of the 20 have <b>out-of-state tuition</b> among the top 90th percentile, with California-Irvine and California-Berkely being the notable outliers with 69th and 66th percentile respectively, and Georgia Institute of Technology being an extreme outlier with 16th percentile.<br> - 70% of the Top 20 Elite have <b>Room.Board</b> expenses among the top 85th percentile.<br> - <b>Student-faculty ratio</b> is generally lower than overall, with 14 of the top 20 having S.F.Ratio below the 25th percentile.<br>   Here again California-Irvine, California-Berkely and Georgetown Institute of Technology stand out with 73th, 71st and 91st percentiles respectively.<br> - 15 of the 20 are among the top 80th percentile in terms of proportion of alumni that donate (<b>perc.alumni</b>).<br> - <b>Graduation rates</b> are higher than the norm among the Elite institutions with 16 of the top 20 'Elite' having Grad.Rates above the 93rd percentile.</div>xxxxxxxxxx## Percentile table"Top 20 elite colleges"subset_df = top20_elitebase_df = collegevar_main = 'Top10perc'var = 'PhD'n = len(subset_df.index)k = 90# Percentilespercentiles_main = [round(ptile(base_df[var_main].values,i,'weak'),2) for i in subset_df[var_main].values]percentiles = [round(ptile(base_df[var].values,i,'weak'),2) for i in subset_df[var].values]percentile_df = pd.DataFrame(np.column_stack([subset_df.index.get_level_values('College'), subset_df[var], percentiles, percentiles_main]))percentile_df.columns = ['College', var, 'Percentile', 'Percentile_Main']# Count below and above kth percentileprint('<',k,': ',sum(percentile_df['Percentile']<k),'('+str(sum(percentile_df['Percentile']<k)/n*100)+')')print('>=',k,': ',sum(percentile_df['Percentile']>=k),'('+str(sum(percentile_df['Percentile']>=k)/n*100)+')')percentile_df.sort_values(by=['Percentile','Percentile_Main'], ascending=False)xxxxxxxxxx## Boxplots - Overall distribution v subset distributionfig, (ax1, ax2) = plt.subplots(nrows=2, sharex=True, figsize=(12,2), constrained_layout=True)ax1 = plt.subplot(211)sns.boxplot(x=subset_df[var], width=0.3, color='thistle', fliersize=3).set(title=var)applyBoxPlotStyle(ax1)plt.axis('off')ax2 = plt.subplot(212, sharex=ax1)box = sns.boxplot(x=base_df[var], width=0.4, color='cadetblue', fliersize=3)ax2.tick_params(left=False)ax2.set(xlabel='')applyBoxPlotStyle(ax2)min_ylim, max_ylim = plt.ylim()plt.vlines(x=np.percentile(base_df[var], [5,10,90,95]), ymin=min_ylim, ymax=max_ylim, colors='k', ls=':', lw=0.8)## Histograms - Overall and subsetplt.subplots(1, 2, figsize=(10, 3))plt.subplot(121)plt.box(False)sns.histplot(x=base_df[var]).set(title=var+'\n(Overall)')plt.subplot(122)sns.histplot(x=subset_df[var]).set(title=var+'\n(Subset)')plt.box(False)plt.subplots_adjust(bottom=-0.3)plt.tight_layout()## Scatterplot - var_main vs varplt.subplots(1, 2, figsize=(10, 4))plt.subplot(121)sns.scatterplot(x=base_df[var_main], y=base_df[var]).set(title=var_main+' v '+var+'\n(Overall)')plt.figtext(0.02, 0.95, 'Corr:'+str(base_df[var_main].corr(base_df[var]).round(2)), fontsize=11)plt.box(False)plt.subplot(122)sns.scatterplot(x=subset_df[var_main], y=subset_df[var]).set(title=var_main+' v '+var+'\n(Subset)')plt.figtext(0.52, 0.95, 'Corr: '+str(subset_df[var_main].corr(subset_df[var]).round(2)), fontsize=11)plt.box(False)plt.subplots_adjust(bottom=-0.3)plt.tight_layout();xxxxxxxxxx## Statistical test for comparing base and subset# Anderson-Darling test for comparing 2 samplesfrom scipy.stats import anderson_ksampanderson_ksamp([base_df[var], subset_df[var]], midrank=True)# Kolmogorov-Smirnov testfrom scipy.stats import ks_2sampks_2samp(base_df[var], subset_df[var])xxxxxxxxxxInteractiveShell.ast_node_interactivity = "last_expr"xxxxxxxxxx##### #'#####################################################################xxxxxxxxxx## Boxplotssubset_df = top20_elite.select_dtypes('number')base_df = collegeovars = [c for c in subset_df.columns if c in base_df]n = len(ovars)with warnings.catch_warnings(): warnings.simplefilter("ignore") fxn() for var in ovars: fig, (ax1, ax2) = plt.subplots(nrows=2, sharex=True, figsize=(12,2), facecolor='w') sns_pars(13,14,9) adt = anderson_ksamp([base_df[var],subset_df[var]], midrank=True) kst = ks_2samp(base_df[var],subset_df[var], 'two-sided') plt.figtext(0, 1, 'KS test: p-val ='+str(kst[1].round(5)), fontsize=10) plt.figtext(0, 0.9, 'AD test: sig-lvl ='+str(round(adt[2],5)), fontsize=10) ax1 = plt.subplot(211) sns.boxplot(x=subset_df[var], width=0.4, color='thistle', fliersize=3) applyBoxPlotStyle(ax1) plt.axis('off') ax1.set(title = var) ax2 = plt.subplot(212, sharex=ax1) box = sns.boxplot(x=base_df[var], width=0.4, color='cadetblue', fliersize=3) applyBoxPlotStyle(ax2,False) ax2.set(xlabel=None) ax2.tick_params(left=False) fig.subplots_adjust(bottom=0.3) min_ylim, max_ylim = plt.ylim() plt.vlines(x=np.percentile(base_df[var], [5,10,90,95]), ymin=min_ylim, ymax=max_ylim, colors='k', ls=':', lw=0.8);sns.reset_orig()xxxxxxxxxx##### #'####################################################################### Statistical test comparing variables and their subsettest_df = pd.DataFrame(np.zeros([n,3]),columns=['KS (p-value)','AD (min sig lvl)','Significant'], index=ovars)with warnings.catch_warnings(): warnings.simplefilter('ignore') fxn() for var in list(ovars): # Kolmogrov-Smirnov test kst = ks_2samp(base_df[var],subset_df[var], 'two-sided') ksp = kst[1] ksps = ksp <= 0.05 # Anderson-Darling test adt = anderson_ksamp([base_df[var],subset_df[var]], midrank=True) adsl = adt[2] adss = adt[0] > adt[1][2] # Add to df sig = 'Y' if (ksps and adss) else 'N' if (not ksps and not adss) else '-' test_df.loc[var] = [round(ksp,5),round(adsl,5),sig]test_dfxxxxxxxxxx##### #'#####################################################################xxxxxxxxxx##### Correlation matrix - Subset correlation and change in correlation## Correlation matrix - Subset correlation and change in correlationmask_up = np.triu(np.ones_like(subset_df.corr(), dtype=bool))mask_down = np.tril(np.ones_like(base_df.corr(), dtype=bool))f, ax = plt.subplots(figsize=(9, 9))sns.heatmap(subset_df.corr(), annot=True, mask=mask_up, fmt='.2f', cmap="YlGnBu", annot_kws={"size":10,"alpha":1}, cbar = False, linewidths=0.01, square=True)sns.heatmap(subset_df.corr()-base_df.corr(), annot=True, mask=mask_down, fmt='.2f', cmap="BuPu", annot_kws={"size":10,"alpha":1}, cbar = False, linewidths=0.01, square=True)plt.title("Subset corr - Overall corr", fontsize=12)plt.ylabel("Subset Correlation", fontsize=12)plt.yticks(fontsize=11);xxxxxxxxxx###### ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------xxxxxxxxxx### Code help sources[List of sources for the book](https://github.com/rahul-ahuja1/Solutions_-_An_Introduction_to_Statistical_Learning#solutions---an-introduction-to-statistical-learning)xxxxxxxxxx## Short description of variablesStatistics for a large number of US Colleges from the 1995 issue of US News and World Report. [Return to Index](#Index)• <b>Private</b> : Public/private indicator • <b>Apps</b> : Number of applications received • <b>Accept</b> : Number of applicants accepted • <b>Enroll</b> : Number of new students enrolled • <b>Top10perc</b> : New students from top 10 % of high school class • <b>Top25perc</b> : New students from top 25 % of high school class • <b>F.Undergrad</b> : Number of full-time undergraduates • <b>P.Undergrad</b> : Number of part-time undergraduates • <b>Outstate</b> : Out-of-state tuition • <b>Room.Board</b> : Room and board costs • <b>Books</b> : Estimated book costs • <b>Personal</b> : Estimated personal spending • <b>PhD</b> : Percent of faculty with Ph.D.’s • <b>Terminal</b> : Percent of faculty with terminal degree • <b>S.F.Ratio</b> : Student/faculty ratio • <b>perc.alumni</b> : Percent of alumni who donate • <b>Expend</b> : Instructional expenditure per student • <b>Grad.Rate</b> : Graduation rate Statistics for a large number of US Colleges from the 1995 issue of US News and World Report.
Return to Index
• Private : Public/private indicator
• Apps : Number of applications received
• Accept : Number of applicants accepted
• Enroll : Number of new students enrolled
• Top10perc : New students from top 10 % of high school class
• Top25perc : New students from top 25 % of high school class
• F.Undergrad : Number of full-time undergraduates
• P.Undergrad : Number of part-time undergraduates
• Outstate : Out-of-state tuition
• Room.Board : Room and board costs
• Books : Estimated book costs
• Personal : Estimated personal spending
• PhD : Percent of faculty with Ph.D.’s
• Terminal : Percent of faculty with terminal degree
• S.F.Ratio : Student/faculty ratio
• perc.alumni : Percent of alumni who donate
• Expend : Instructional expenditure per student
• Grad.Rate : Graduation rate